In [9]:

    
# Note to presentation mode: run the first two cells before presenting.
# Render our plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

# Make the graphs a bit prettier
pd.set_option('display.mpl_style', 'default') 
plt.rcParams['figure.figsize'] = (18, 5)









    



/usr/local/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2862: FutureWarning: 
mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.

  exec(code_obj, self.user_global_ns, self.user_ns)



In [10]:

    
%%HTML
<!-- Not content, but for better presentation styles -->
<style>
.rise-enabled .rendered_html table.dataframe {
    font-size: 16px;
}
.rendered_html code {
    background: #EFEFEF;
    border: 1px solid #CCC;
    padding: 1px 5px;
    border-radius: 1px;
}
.rise-enabled .fa-times-circle:before,
.rise-enabled .fa-question:before,
.rise-enabled aside.controls {
    opacity: 0.1;
}
.rise-enabled a,
.rise-enabled a:active,
.rise-enabled a:hover,
.rise-enabled a:visited,
.rise-enabled :link {
    text-decoration: none;
    color: #A68910;
}
</style>

Messy Sensor Data:

A Programmer's Cleaning Guide

@Xavier_Ho, #pyconau

*Feel free to tag me on Twitter for questions and comments.*

This talk is a tutorial (thanks @evildmp!) for cleaning up messy data with pandas. You might also find these other ones useful:

Using Python in a Data Hackathon - Tennessee Leeuwenburg
Visualising data with Python - Clare Sloggett

Slides will be available online.

There will be time for questions at the end - happy to navigate specific problems you may have.

Motivation: healthy water

Aquatic microorganisms thrive and die with seasonal temperatures.

Cyanobacteria, or "blue-green algae", produces oxygen in water.

Too much cyanobacteria in water is harmful to consume.

We model cyanobacteria in water with weather data.

In short: we can predict and track cyanobacteria, and keep our water healthy.

Let's take a look at our weather sensor data...

In this talk, we use pandas for data wrangling.

You can install it with your favourite package manager (instructions).

$ conda install pandas

To begin:

pandas.read_csv(): Opens a CSV file as a DataFrame, like a table.
DataFrame.head(): Displays the first 5 entries.

Read a CSV file



In [11]:

    
import pandas as pd

# Open a comma-separated values (CSV) file as a DataFrame
weather_observations = pd.read_csv('observations/Canberra_observations.csv')

# Print the first 5 entries
weather_observations.head()









    Out[11]:







  
    
      
      Date	Time	Wind dir	Wind spd	Wind gust	Tmp	Dew pt	Feels like	rh	Fire	Rain	Rain 10'	Pres
    
  
  
    
      0
      01/01/2013\tWed 00:00 EDT\tWNW\t9\t9\t19.6\t0....
    
    
      1
      01/01/2013\tTue 23:30 EDT\tWNW\t-\t13\t20.7\t0...
    
    
      2
      01/01/2013\tTue 23:00 EDT\tW\t9\t11\t22.8\t0.9...
    
    
      3
      01/01/2013\tTue 22:30 EDT\tWNW\t11\t13\t22.6\t...
    
    
      4
      01/01/2013\tTue 22:00 EDT\tWNW\t15\t19\t23.9\t...

The file seems to be tab seperated. There are dates, and some empty items.

Can we read it more clearly?

pandas.read_csv() is very versatile with keyword arguments.

sep: The separator between columns.

parse_dates: Treat one or more columns like dates.
dayfirst: Use DD.MM.YYYY format, not month first.
infer_datetime_format: Tell pandas to guess the date format.

na_values: Specify values to be treated as empty.

Read a CSV file the 🐼🐼 way



In [12]:

    
# Supply pandas with some hints about the file to read
weather_observations = pd.read_csv('observations/Canberra_observations.csv',
#                                    sep='\t',
#                                    parse_dates={'Datetime': ['Date', 'Time']},
#                                    dayfirst=True,
#                                    infer_datetime_format=True,
#                                    na_values=['-'],
                                  )
# Display some entries
weather_observations.head()









    Out[12]:







  
    
      
      Date	Time	Wind dir	Wind spd	Wind gust	Tmp	Dew pt	Feels like	rh	Fire	Rain	Rain 10'	Pres
    
  
  
    
      0
      01/01/2013\tWed 00:00 EDT\tWNW\t9\t9\t19.6\t0....
    
    
      1
      01/01/2013\tTue 23:30 EDT\tWNW\t-\t13\t20.7\t0...
    
    
      2
      01/01/2013\tTue 23:00 EDT\tW\t9\t11\t22.8\t0.9...
    
    
      3
      01/01/2013\tTue 22:30 EDT\tWNW\t11\t13\t22.6\t...
    
    
      4
      01/01/2013\tTue 22:00 EDT\tWNW\t15\t19\t23.9\t...

Note: NaN in the table above means empty, not the floating-point number value.



In [13]:

    
# For consistency between slides
weather_observations = pd.read_csv('observations/Canberra_observations.csv',
                                   sep='\t',
                                   parse_dates={'Datetime': ['Date', 'Time']},
                                   dayfirst=True,
                                   infer_datetime_format=True,
                                   na_values=['-']
                                  )

Small steps in data cleaning

The data seems complete, but still has some issues:

Each day includes midnight, and another midnight next day.
Order starts at end of day and goes backwards in time.

pandas offers some functions to help us out:

DataFrame.drop_duplicates(): Delete duplicated items.
DataFrame.sort_values(): Rearrange in order.
DataFrame.set_index(): Specify a column to use as index.

Keep order in our data



In [14]:

    
# Remove duplicated items with the same date and time
no_duplicates = weather_observations.drop_duplicates('Datetime', keep='last')

# Sorting is ascending by default, or chronological order
# sorted_dataframe = no_duplicates.sort_values('Datetime')

# Use `Datetime` as our DataFrame index
# indexed_weather_observations = sorted_dataframe.set_index('Datetime')

# Display some entries
no_duplicates.head()









    Out[14]:







  
    
      
      Datetime
      Wind dir
      Wind spd
      Wind gust
      Tmp
      Dew pt
      Feels like
      rh
      Fire
      Rain
      Rain 10'
      Pres
    
  
  
    
      1
      2013-01-01 23:30:00
      WNW
      NaN
      13.0
      20.7
      0.7
      20.7
      26.0
      13.0
      0.0
      0.0
      1011.1
    
    
      2
      2013-01-01 23:00:00
      W
      9.0
      11.0
      22.8
      0.9
      22.8
      23.0
      15.0
      0.0
      0.0
      1011.0
    
    
      3
      2013-01-01 22:30:00
      WNW
      11.0
      13.0
      22.6
      -1.0
      22.6
      21.0
      17.0
      0.0
      0.0
      1011.0
    
    
      4
      2013-01-01 22:00:00
      WNW
      15.0
      19.0
      23.9
      -2.2
      23.9
      18.0
      21.0
      0.0
      0.0
      1011.1
    
    
      5
      2013-01-01 21:30:00
      W
      13.0
      17.0
      24.0
      -2.0
      24.0
      18.0
      20.0
      0.0
      0.0
      1011.0



In [15]:

    
# For consistency
no_duplicates = weather_observations.drop_duplicates('Datetime', keep='last')
sorted_dataframe = no_duplicates.sort_values('Datetime')
indexed_weather_observations = sorted_dataframe.set_index('Datetime')

Let wind be numbers

To process wind direction, we need them in numbers.

North wind (↓) is 0 degrees, going clockwise ⟳.
East wind (←) is 90 degrees.

We need to transform each direction label "N", "NNE", "NE", "ENE", "E", etc., to numbers on the column.

Series.apply(): Transforms each entry with a function.

Transforming DataFrames



In [16]:

    
# Translate wind direction to degrees
wind_directions = {
     'N':   0. , 'NNE':  22.5, 'NE':  45. , 'ENE':  67.5 ,
     'E':  90. , 'ESE': 112.5, 'SE': 135. , 'SSE': 157.5 ,
     'S': 180. , 'SSW': 202.5, 'SW': 225. , 'WSW': 247.5 ,
     'W': 270. , 'WNW': 292.5, 'NW': 315. , 'NNW': 337.5 }



In [17]:

    
# Replace wind directions column with a new number column
# `get()` accesses values safely from dictionary
indexed_weather_observations['Wind dir'] = \
    indexed_weather_observations['Wind dir'].apply(wind_directions.get)
    
# Display some entries
indexed_weather_observations.head()









    Out[17]:







  
    
      
      Wind dir
      Wind spd
      Wind gust
      Tmp
      Dew pt
      Feels like
      rh
      Fire
      Rain
      Rain 10'
      Pres
    
    
      Datetime
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2013-01-01 00:00:00
      67.5
      2.0
      6.0
      15.9
      3.7
      15.9
      44.0
      5.0
      0.0
      0.0
      1014.1
    
    
      2013-01-01 00:30:00
      NaN
      0.0
      0.0
      14.5
      4.7
      14.5
      52.0
      3.0
      0.0
      0.0
      1014.0
    
    
      2013-01-01 01:00:00
      NaN
      0.0
      0.0
      12.8
      4.6
      12.8
      57.0
      3.0
      0.0
      0.0
      1013.5
    
    
      2013-01-01 01:30:00
      202.5
      6.0
      9.0
      13.3
      5.9
      13.3
      61.0
      3.0
      0.0
      0.0
      1013.4
    
    
      2013-01-01 02:00:00
      225.0
      2.0
      7.0
      12.5
      4.7
      12.5
      59.0
      3.0
      0.0
      0.0
      1013.2

Irregular data frequency

Our data has odd timestamps from time to time, For example:



In [18]:

    
# One section where the data has weird timestamps ...
indexed_weather_observations[1800:1806]









    Out[18]:







  
    
      
      Wind dir
      Wind spd
      Wind gust
      Tmp
      Dew pt
      Feels like
      rh
      Fire
      Rain
      Rain 10'
      Pres
    
    
      Datetime
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2013-02-04 00:30:00
      67.5
      9.0
      15.0
      15.1
      12.6
      15.1
      85.0
      1.0
      0.0
      0.0
      1020.2
    
    
      2013-02-04 00:33:00
      67.5
      7.0
      11.0
      15.0
      12.7
      15.0
      86.0
      1.0
      0.0
      0.0
      1020.2
    
    
      2013-02-04 01:00:00
      67.5
      6.0
      9.0
      14.9
      12.8
      14.9
      87.0
      1.0
      0.0
      0.0
      1020.2
    
    
      2013-02-04 01:11:00
      45.0
      4.0
      7.0
      15.0
      12.9
      15.0
      87.0
      1.0
      0.0
      0.0
      1020.1
    
    
      2013-02-04 01:30:00
      67.5
      6.0
      7.0
      15.4
      12.9
      15.4
      85.0
      1.0
      0.0
      0.0
      1020.0
    
    
      2013-02-04 02:00:00
      NaN
      0.0
      0.0
      14.7
      13.1
      14.7
      90.0
      1.0
      0.0
      0.0
      1019.9

We can solve this. DataFrame.asfreq() forces a specific frequency on the index, discarding and filling the rest.

Setting data frequency

Let's set each data point to be every 30 minutes.



In [19]:

    
# Force the index to be every 30 minutes
regular_observations = indexed_weather_observations.asfreq('30min')
    
# Same section at different indices since setting its frequency :)
regular_observations[1633:1638]









    Out[19]:







  
    
      
      Wind dir
      Wind spd
      Wind gust
      Tmp
      Dew pt
      Feels like
      rh
      Fire
      Rain
      Rain 10'
      Pres
    
    
      Datetime
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2013-02-04 00:30:00
      67.5
      9.0
      15.0
      15.1
      12.6
      15.1
      85.0
      1.0
      0.0
      0.0
      1020.2
    
    
      2013-02-04 01:00:00
      67.5
      6.0
      9.0
      14.9
      12.8
      14.9
      87.0
      1.0
      0.0
      0.0
      1020.2
    
    
      2013-02-04 01:30:00
      67.5
      6.0
      7.0
      15.4
      12.9
      15.4
      85.0
      1.0
      0.0
      0.0
      1020.0
    
    
      2013-02-04 02:00:00
      NaN
      0.0
      0.0
      14.7
      13.1
      14.7
      90.0
      1.0
      0.0
      0.0
      1019.9
    
    
      2013-02-04 02:30:00
      NaN
      0.0
      0.0
      14.6
      13.0
      14.6
      90.0
      1.0
      0.0
      0.0
      1019.9

Plotting some data for a better picture



In [20]:

    
# Plot the first 500 entries with selected columns
regular_observations[['Wind spd', 'Wind gust', 'Tmp', 'Feels like']][:500].plot()









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a25ab00>

There seems to be gaps in our dataset between 4th Jan and 9th Jan.

Gaps in the data

Weather sensors sometimes drop out.

But for modelling purposes, we need a gap-free dataset.

Series.interpolate(): Fill in empty values based on index.

Interpolate and fill empty rows



In [21]:

    
# Interpolate data to fill empty values
for column in regular_observations.columns:
    regular_observations[column].interpolate('time', inplace=True)

# Display some interpolated entries    
regular_observations[1633:1638]









    Out[21]:







  
    
      
      Wind dir
      Wind spd
      Wind gust
      Tmp
      Dew pt
      Feels like
      rh
      Fire
      Rain
      Rain 10'
      Pres
    
    
      Datetime
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2013-02-04 00:30:00
      67.5
      9.0
      15.0
      15.1
      12.6
      15.1
      85.0
      1.0
      0.0
      0.0
      1020.2
    
    
      2013-02-04 01:00:00
      67.5
      6.0
      9.0
      14.9
      12.8
      14.9
      87.0
      1.0
      0.0
      0.0
      1020.2
    
    
      2013-02-04 01:30:00
      67.5
      6.0
      7.0
      15.4
      12.9
      15.4
      85.0
      1.0
      0.0
      0.0
      1020.0
    
    
      2013-02-04 02:00:00
      67.5
      0.0
      0.0
      14.7
      13.1
      14.7
      90.0
      1.0
      0.0
      0.0
      1019.9
    
    
      2013-02-04 02:30:00
      67.5
      0.0
      0.0
      14.6
      13.0
      14.6
      90.0
      1.0
      0.0
      0.0
      1019.9

Plotting to be sure!



In [22]:

    
# Plot it again - gap free!
regular_observations[['Wind spd', 'Wind gust', 'Tmp', 'Feels like']][:500].plot()









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b0c1390>

Now the dataset is ready to be used for modelling! 🎉

You too can keep our water healthy 🐹

Today you learned how to use pandas to in many ways:

Reading a CSV file with proper structures
Sorting your dataset
Transforming columns by applying a function
Regulate data frequency from odd ones
Interpolate and fill missing data
Plotting your dataset

pandas is much more powerful than what we covered today. Check out the documentation! You might find some gems.